An annotated chromosome-scale reference genome for Eastern black-eared wheatear (Oenanthe melanoleuca)

Abstract Pervasive convergent evolution and in part high incidences of hybridization distinguish wheatears (songbirds of the genus Oenanthe) as a versatile system to address questions at the forefront of research on the molecular bases of phenotypic and species diversification. To prepare the genomic resources for this venture, we here generated and annotated a chromosome-scale assembly of the Eastern black-eared wheatear (Oenanthe melanoleuca). This species is part of the Oenanthe hispanica complex that is characterized by convergent evolution of plumage coloration and high rates of hybridization. The long-read-based male nuclear genome assembly comprises 1.04 Gb in 32 autosomes, the Z chromosome, and the mitogenome. The assembly is highly contiguous (contig N50, 12.6 Mb; scaffold N50, 70 Mb), with 96% of the genome assembled at the chromosome level and 95.5% benchmarking universal single-copy orthologs (BUSCO) completeness. The nuclear genome was annotated with 18,143 protein-coding genes and 31,333 mRNAs (annotation BUSCO completeness, 98.0%), and about 10% of the genome consists of repetitive DNA. The annotated chromosome-scale reference genome of Eastern black-eared wheatear provides a crucial resource for research into the genomics of adaptation and speciation in an intriguing group of passerines.


Introduction
Wheatears of the genus Oenanthe and their relatives-together referred to as "open-habitat chats"-are a group of songbirds that display several remarkable characteristics distinguishing them as a versatile system to address key questions on the evolution of phenotypes and formation of species. Many phenotypes, including multiple conspicuous color ornaments, seasonal migration, and sexual dimorphism, appear independently in multiple branches within open-habitat chats, suggesting a high incidence of convergent evolution (Aliabadian et al. 2012;Alaei Kakhki et al. 2013;Schweizer et al. 2019aSchweizer et al. , 2019b. Furthermore, hybridization is observed in several species complexes and occurs at notably high rates in the Oenanthe hispanica complex that consists of 4 currently recognized taxa (Schweizer et al. 2019a(Schweizer et al. , 2019b: Western black-eared wheatear (O. hispanica), pied wheatear (Oenanthe pleschanka), cyprus wheatear (Oenanthe cypriaca), and Eastern black-eared wheatear (Oenanthe melanoleuca; Fig. 1). Pied and Eastern black-eared wheatear hybridize pervasively at the western shores of the Black Sea, in the Caucasus, and in the Alborz mountains of northern Iran (Haffer 1977;Panov 2005). The resulting introgression reaches beyond the hybrid zones (Schweizer et al. 2019a(Schweizer et al. , 2019b, and hybrid zones themselves sport admixed phenotypes that display combinations of plumage color phenotypes divergent between species (mantle and neck-side coloration) (Haffer 1977;Panov 2005). Finally, a phenotype divergently expressed between many wheatear species, black-or-white throat coloration, segregates as polymorphisms in 3 species of the O. hispanica complex. Once a high-quality reference genome is available, this polymorphism and the recombination of mantle and neck-side coloration in hybrids provide an excellent opportunity to map these phenotypes to the genome (Buerkle and Lexer 2008) and study their convergent evolution across openhabitat chats. Furthermore, hybridization in several geographic regions enables insights into common or idiosyncratic patterns of evolution under hybridization (Gompert et al. 2017).
Here, we describe the de novo assembly and annotation of a chromosome-scale reference genome for the Eastern black-eared wheatear (O. melanoleuca). The assembly includes models for 32 autosomes, the Z chromosome, and the mitogenome that together cover 90% of the k-mer-based genome size estimate (94% with unplaced scaffolds included); it is highly contiguous with a scaffold N50 of 70 Mb and benchmarking universal single-copy orthologs (BUSCO) completeness score of 95.5%. This reference genome enables genomic research into the evolutionary history of phenotypic and species diversification in wheatears and their close relatives.

Sampling, tissue preservation, and nucleic acid extraction
To obtain optimal starting material for a reference individual, we freshly sampled a male Eastern black-eared wheatear (O. melanoleuca) well outside known hybrid zones (Haffer 1977;Panov 2005) in Galaxidi, Greece (sampling permit no. 181968/989, issued by the Ministry of Environment and Energy, General Secretariat of Environment, General Directorate of Forests and Forest Environment, Directorate of Forest Management, and Department of Wildlife and Game Management; export permit no. 55980/1575, Regional CITES Management Authority Attika). For this purpose, we sampled about 100 µL of blood from the brachial vein, and, after euthanizing the bird, we extracted all tissues possible. Tissues were immediately snap-frozen in liquid nitrogen. Throughout transportation and storage preceding DNA extraction, the samples were kept at a temperature below −80°C.
To obtain ultra-high molecular weight (UHMW) DNA from the reference individual, NGI Uppsala (Sweden) extracted DNA from the blood sample using the Bionano Prep Blood and Cell Culture DNA Isolation Kit (Bionano, San Diego, USA). Electrophoresis on a Femto Pulse instrument showed a mean DNA fragment length of about 200 kb, with fragments reaching up to 800 kb.
To prepare muscle tissue for Hi-C sequencing library preparation, we pulverized breast muscle tissue from the reference individual in a mortar. To avoid unfreezing of the tissue powder, the procedure was carried out in a climate chamber at 4°C under regular addition of liquid nitrogen.
To prepare RNA for full-length transcript sequencing, we extracted total RNA from 8 snap-frozen tissues kept at −80°C (brain, breast muscle, heart, kidney, liver, lung, spleen, and testis) using the RNeasy Mini Kit (Qiagen; Hombrechtikon, Switzerland) according to the manufacturer's instructions. RNA quality was assessed with a Fragment Analyzer (Agilent). RNA from spleen showed considerable degradation and was excluded from further analyses.

Assembly strategy and data acquisition
To obtain a chromosome-scale reference genome, our strategy largely followed the multiplatform approach recommended by Peona et al. (2021a). In brief, it consisted of (1) a phased primary assembly based on long reads, (2) polishing and scaffolding of the primary assembly with linked-read sequencing data, and (3) scaffolding of the secondary assembly with proximity ligation (Hi-C) information.
To this end, we obtained a total of 215-Gb (unique coverage 151 Gb) Pacific Biosciences (PacBio) long-read sequence data, 54-Gb linked-read sequence data, and 83-Gb Hi-C data. NGI Uppsala (Sweden) prepared a PacBio library from UHMW DNA using the SMRTbell Template Prep Kit 1.0 and sequenced this library on 18 SMRT Cells 1M v3 on a PacBio Sequel instrument (Sequel Binding Kit 3.0, Sequel Sequencing Plate 3.0). PacBio longread data was initially processed using SMRT Link v6. A linkedread sequencing library was prepared using the 10× Genomics Chromium Genomic Kit (from the same DNA extraction as used for PacBio sequencing; 10× Genomics, Inc., Pleasanton, CA, USA; Cat No. 120215), and a Hi-C library was prepared with the Dovetail Omni-C kit (Scotts Valley, CA, USA; Cat No. 21005). The linked-read and Hi-C libraries were prepared and sequenced on a NovaSeq 6000 instrument (S4 lane, 150-bp paired-end reads) at the facilities of NGI Stockholm (Sweden).

Genome size estimation
We estimated genome size by counting k-mer frequency of the quality-checked 10× Genomics linked reads. To this end, we first trimmed 22 bp from all 10× Genomics linked reads using fastp (Chen et al. 2018) to remove indices from R1 reads and keep symmetric read lengths for the R2 reads. We then counted k-mers of size 21 using jellyfish 2.2.10 (Marçais and Kingsford 2011) and used GenomeScope (Vurture et al. 2017) to estimate genome size from k-mer count histograms.

De novo genome assembly
We assembled the PacBio long reads into the phased primary assembly using the Falcon Unzip 0.5 assembler (Chin et al. 2016), followed by polishing with Arrow 1.9.0. Before assembly polishing, we masked repeat regions of the phased primary assembly with RepeatMasker 4.1.0 (Smit et al. 1996(Smit et al. -2010 using a custom repeat library Boman et al. 2019;Weissensteiner et al. 2020;Peona et al. 2021aPeona et al. , 2021b to make accurate assembly corrections without overcorrecting large repeats. We then polished the masked assembly with 2 rounds of Pilon v1.22 (Walker et al. 2014) with the parameter "-fix indels" using the reference individual's linked-read data. To purge duplicate scaffolds from the assembly, we ran purge_dups 1.2.5 (Guan et al. 2020) on the polished assembly. Prior to scaffolding with linked-read data, we split potential mis-assemblies with reference-individual linked-read data using Tigmint 1.2.4 (Jackman et al. 2018). With the aim to scaffold the polished remaining contigs, we applied ARCS 1.2.2 andLINKS 2.0.0 using the reference individual's linked-read data using default parameters (Warren et al. 2015;Yeo et al. 2018).
To further scaffold the assembly, we applied the 3D DNA pipeline (Dudchenko et al. 2017) to join the sequences into chromosomes. We first used Juicer v.1.6 (Durand et al. 2016) to map Hi-C data against the contigs and to filter reads and then ran the asm-pipeline v.180922 to generate a draft scaffolding.
Finally, we corrected mis-assemblies based on the visual inspection of the proximity map using Juicebox 2.13.06 (Robinson et al. 2018). The final chromosome-level assembly was polished with 2 additional rounds of Pilon as described above.
To assess homology of the assembled scaffolds with bird chromosomes, we aligned the final genome assembly to the genomes of collared flycatcher (Ficedula albicollis) (FicAlb1.5) (Kawakami et al. 2014a(Kawakami et al. , 2014b, zebra finch (taeGut3.2.4) (Warren et al. 2010), and chicken (GRCg6a) (Bellott et al. 2017) using D-GENIES (Cabanettes and Klopp 2018). Chromosomes were named according to homology with these 3 genomes. In cases, such as chicken chromosomes 1 and 4 that are split to multiple chromosomes in songbirds, the nomenclature in the wheatear genome was adapted to the species whose homologous chromosome matched closest.

Mitogenome assembly
To assemble the mitochondrial genome, we used the MitoFinder 1.4 (Allio et al. 2020) and mitoVGP 2.2 ) pipelines with the published Oenanthe isabellina mitochondrial genome (GenBank accession number: NC_040290.1) as reference. We ran MitoFinder with the reference individual's short-read data (linked-read data but without making use of the linked-read haplotype information), and with mitoVGP, we made joint use of the linked-read and long-read data. From MitoFinder, we extracted the longest contig containing all 13 protein-coding genes, 2 rRNA genes, and 22 tRNAs annotated by MitoFinder as mitogenome assembly. We annotated both assemblies using the MITOS WebServer (http://mitos2.bioinf.uni-leipzig.de/index.py).

Assembly quality evaluation
To evaluate assembly quality at each assembly step, we estimated basic assembly statistics using QUAST 5.0.2 (Gurevich et al. 2013) and evaluated the completeness of expected gene content in the assembly based on BUSCO (Simão et al. 2015) with the avian data set aves_odb10 (8,338 BUSCO) in BUSCO 5.0.0.

Repeat annotation
The final version of the genome assembly was used to de novo characterize both interspersed and tandem repeats. For interspersed repeats, we used RepeatModeler2 (Flynn et al. 2020) with the option "-LTR_struct" to obtain an improved characterization of LTR retrotransposons which are commonly found in avian genomes Boman et al. 2019;Peona et al. 2021aPeona et al. , 2021b. The resulting library of raw consensus sequences was filtered from consensus sequences of tandem repeats (for which we ran a specific analysis; see below) and from protein-coding genes using the Snakemake pipeline repeatlib_filtering_workflow v0.1.0 (https://github.com/NBISweden/repeatlib_filtering_workflow).
For tandem repeats, we used RepeatExplorer2 (Novák et al. 2020) to search for satellite DNA (satDNA) sequences using the reference individual's 10× Genomics linked reads. Prior to RepeatExplorer2 graph-based clustering analysis, sequencing reads were preprocessed and checked by quality with FastQC (Babraham Bioinformatics: Cambridge 2012) using the public online platform at https://galaxy-elixir.cerit-sc.cz/. We processed the reads with the "quality trimming", "FASTQ interlacer on the paired end reads," and "FASTQ to FASTA converter", followed by "RepeatExplorer2 clustering" tools with default parameters. Each reference sequence assembled by RepeatExplorer2 consisted of a monomer of the satDNA consensus sequence. The relative genomic abundance and nucleotide divergence (Kimura 2-parameter distance) of each detected satDNA were estimated by sampling 4 million read pairs and aligning them to the satDNA library with RepeatMasker 4.1.0 (Smit et al. 1996(Smit et al. -2010. The sampled reads were mapped to dimers of satDNA consensus sequences, and for smaller satDNAs, several monomers were concatenated until reaching roughly 150-bp array length. The resulting RepeatMasker.align file was then parsed to the script calcDivergenceFromAlign.pl from RepeatMasker utils. The relative abundance of each satDNA sequence was then estimated as the proportion of nucleotides aligned with the reference sequence with respect to the total Illumina library size.
The RepeatModeler2 library was then merged with the satDNA library produced here and with known avian consensus sequences of transposable elements (TEs) from Repbase (Bao et al. 2015), Dfam (Storer et al. 2021(Storer et al. , 2021, flycatcher, blue-capped cordon-bleu, hooded crow, and paradise crow Boman et al. 2019;Weissensteiner et al. 2020;Peona et al. 2021aPeona et al. , 2021b. This library was then used to annotate the genome assembly with RepeatMasker (Smit et al. 1996(Smit et al. -2010. The annotation produced was processed with the script calcDivergenceFromAlign.pl from RepeatMasker utils to calculate the divergence between repeats and their consensus sequences using the Kimura 2-parameter distance corrected for the presence of CpG sites.

Full-length transcript sequencing and genome annotation
We aimed to establish a high-quality genome annotation based on full-length transcripts. To this end, for each of the abovementioned 7 tissues, the NGS platform of the University of Bern, Switzerland, prepared an Iso-Seq library using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences). These 7 libraries were then sequenced on 3 separate SMRT cells 8M, sequencing twice 5 tissues (brain and testis, lung, muscle, and heart) and once 2 tissues (liver and kidney) per SMRT cell. Sequencing of these SMRT cells was conducted on a Pacific Biosciences Sequel II instrument at the Genomic Technologies Facility in Lausanne, Switzerland. As the libraries underloaded, 5 libraries (all but liver and kidney) were jointly sequenced on an additional SMRT cell 8M on a Pacific Biosciences Sequel IIe at the NGS platform of the University of Bern.
Circular consensus sequences (CCS), full-length nonchimeric transcripts, and polished high-and low-quality transcripts were obtained by the NGS platform at the University of Bern separately for each run using the IsoSeq 3 pipeline (ICS v10.1). Polished fulllength isoforms for each sequencing run were merged by tissue and then separately mapped to the reference genome using Minimap v2.2 (-ax splice) (Li 2018(Li , 2021. Transcriptome annotations were generated by first collapsing redundant transcripts using TAMA collapse (-x no_cap), before generating open reading frame (ORF) and nonsense-mediated mRNA decay (NMD) predictions using the scripts implemented in TAMA-GO (Kuo et al. 2020) for each of the 7 tissues. We then evaluated tissue-specific transcriptome completeness using BUSCO (Simão et al. 2015) with the avian data set aves_odb10 (8,338 BUSCO) in BUSCO 5.0.0. Additional transcriptome annotation statistics were obtained using the agat_sp_statistics.pl script implemented in the AGAT toolkit (Dainat 2019).
We annotated the repeat soft-masked genome using GeMoMa 1.9 (Keilwagen et al. 2018;Keilwagen et al. 2019), a homology-based gene prediction tool. This tool is based on the annotation of proteincoding genes and intron position conservation in a reference genome to predict the annotation of protein-coding genes in the target genome. We used the genomes of chicken (GCA_016699485.1; International Chicken Genome Sequencing Consortium 2004), zebra finch (GCA_003957565.2; Warren et al. 2010), silvereye (GCA_001281735.1; Cornetti et al. 2015), and collared flycatcher (GCA_000247815.2; Ellegren et al. 2012;Kawakami et al. 2014aKawakami et al. , 2014b as references for the homology-based gene prediction, along with the reference individual's transcriptome obtained from Iso-Seq data to incorporate RNA evidence for the splice prediction. Using the Extract RNA-seq Evidence tool implemented in GeMoMa, we obtained intron position and coverage. This information was fed into the GeMoMa pipeline (GeMoMa.m = 200,000, AnnotationFinalizer.r = SIMPLE, pc = true, and o = true) to obtain predicted protein-coding gene models. To account for redundancies/duplicates resulting from the predicted protein-coding genes potentially stemming from each of the 4 reference species, genome annotation completeness was assessed by recomputing BUSCO using the BUSCOrecomputer tool in GeMoMa.
Functional annotation of protein-coding genes was obtained with . We then merged the predicted protein-coding gene models and the functional annotation using the agat_sp_manage_functional_annotation.pl script, obtained summary statistics using agat_sp_statistics.pl and agat_sp_functional_statistics.pl, both implemented in the AGAT toolkit. Gene ontology (GO terms) were visualized with WEGO 2.0 (wego.genomics.cn).

Nuclear genome assembly
The polished, unzipped primary assembly contained a total of 1,681 contigs, of which all were >25 kb long and 1,610 were >50 kb long (Table 1). Total assembly length was 1.29 Gb, with the longest contig spanning 45.3 Mb, contig N50 of 8.6 Mb, and half of the assembly placed in 35 contigs. Avian BUSCO were 96.9% complete, with 90.6% being single-copy genes (Table 1).
Purging duplicated contigs resulted in an assembly comprised of 381 contigs with a total assembly length of 1.04 Gb, contig N50 of 13.5 Mb, and half of the assembly placed in 23 contigs (Table 1). After this step, BUSCO completeness remained at 96.4%, but an improvement to nearly 96% single-copy BUSCOs was achieved (Table 1).
Starting from an already highly contiguous assembly, the linked-read data did not yield any scaffolding improvement. Still, Tigmint detected several supposed mis-assemblies and split the assembly into 451 scaffolds. However, an alignment of the original contigs in D-GENIES (Cabanettes and Klopp 2018) showed that all but one of the original contigs (see below) were collinear with the collared flycatcher genome. Given this result and that the proximity ligation data would correct mis-assemblies in subsequent steps, we decided to keep the original contigs except for one aligning to flycatcher chromosomes 2 and 3. For the latter contig, we used the output of Tigmint that split the contig in line with the alignment. The 2 split parts covered all but 12,527 bp of the original contig. Visual inspection of the missing sequence showed that it almost entirely consisted of repeats. We left this sequence in the assembly as a separate contig.
The proximity ligation information obtained through Hi-C scaffolding corrected a number of scaffolds, resulting in a higher number of scaffolds (588) than the number of contigs it started from (383). However, the scaffolding yielded a highly contiguous chromosome-scale assembly (N50, 69.6 Mb; L50, 6) with BUSCO completeness of still >96% and almost all BUSCOs in single copy (Table 1). This final assembly contained all macrochromosomes and the majority of microchromosomes usually found in the latest generation of avian genome assemblies Rhie et al. 2021;Peona et al. 2021aPeona et al. , 2021b. A total of 96% of the assembly was placed into chromosome models, and the chromosome-only assembly covered still 95.5% of BUSCO (Table 1). The final assembly length closely matched the one of previous linked-read-based assemblies of the same species and closely related ones (Schweizer et al. 2019a(Schweizer et al. , 2019bLutgen et al. 2020). The genome size estimated from the k-mer distribution of linked-read sequence was between 1.105 and 1.106 Gb, with 0.925-0.926 Gb of unique and 0.179-0.180 Gb (16%) repeat sequence and 0.75-0.76% heterozygosity (GenomeScope model fit 98-99%). The full final reference genome assembly thus covered 94% of the genome size estimate, with 90% of the estimated genome size placed in chromosomes. A total of 96% of the assembly were placed in 33 chromosomes with homologs in collared flycatcher, zebra finch, and chicken, according to which we adapted the chromosome nomenclature. The differences in genome size estimates based on the k-mer approach and the genome assembly length are likely the result of highly repetitive sequences (e.g. centromeres, telomeres, and satDNAs) that collapsed during the assembly process (Peona et al. 2018). Assembly contiguity and completeness (as judged by BUSCO scores) of the O. melanoleuca assembly compared favorably with other songbird genome assemblies (Table 2).

Mitogenome assembly
MitoFinder and MitoVGP assembled mitogenomes of 16,944 bp and 18,631 bp length, respectively. The mitochondrial contigs assembled by the 2 pipelines were congruent, except for 9 single base pair mismatches, for a 1,827-bp-long insert in the MitoVGP assembly and of a 141-bp-long insert in the MitoFinder assembly. We decided to not consider either of these inserts in the final mitogenome assembly for the following reasons. First, neither of the inserts was observed in the mitogenomes of isabelline and northern wheatear. For the long insert in the MitoVGP assembly, moreover, the coverage of short reads mapped to the MitoVGP assembly was strongly reduced (Supplementary Fig. 1), and the insertion constituted a partial duplication of nd6, duplications of 2 tRNAs (Glu and Pro), and a partial duplication of the control region likely caused by an assembly artifact. The short insert in the MitoFinder assembly was not observed in the other wheatear mitogenomes, and if real, we would expect long reads to cover this insert. Because base calling based on short reads is expected to have higher quality, we retained the MitoFinder assembly, but without the 141-bp insert as final mitogenome.
The final mitogenome (as also both original assemblies) contained all 13 protein-coding genes, 2 rRNAs, and 22 tRNAs (Fig. 2). All genes, except 8 tRNAs and nd6, were located on the heavy DNA strand. Both gene order and strandedness were concordant with those observed in northern wheatear (O. oenanthe) ).

Repetitive element annotation
The de novo identification of repetitive elements resulted in the characterization of 572 raw consensus sequences from RepeatModeler2 and 16 satellite DNA consensus sequences from RepeatExplorer2. The consensus sequences from RepeatModeler2 were filtered from tandem repeats and protein-coding genes. This resulted in a final library of 477 consensus sequences (Supplementary File 1). Among these consensus sequences, RepeatModeler2 classified 226 sequences as LTR retrotransposons, 98 as LINE retrotransposons, 21 as DNA transposons, and 5 as SINE retrotransposons, and 112 sequences were unclassified ("unknown").
The genome assembly annotation run with RepeatMasker using the repeat library produced here and merged with already known avian repeats showed that ∼10% of the assembled genome is repetitive (Fig. 3a and Supplementary Table 1 and File 2). This finding indicates that many repeats collapsed during the genome assembly process. An example of this were satDNAs that represented ∼0.8% of the sequenced reads but only < 0.3% of the genome assembly, suggesting that satDNA repeats [such as in (peri-)centromeric and (sub-)telomeric regions] are the most collapsed repeats. Most of the repeats annotated were LTR and LINE retrotransposons (Fig. 3a). While it is common to find LINEs as most abundant TEs in avian genomes Manthey et al. 2018;Galbraith et al. 2021;Peona, Blom et al. 2021a), it is less common to find so similar percentages of LINE and LTR retrotransposons. This is especially true for a male genome assembly such as the present one here that does not include the W chromosome which is highly enriched in LTRs and acts as a refugium for most of the full-length genomic LTR elements in birds (Peona et al. 2021a(Peona et al. , 2021bWarmuth et al. 2022). The TE Table 2. Comparison of genome assembly and annotation summary statistics of O. melanoleuca with other songbird species (J. hyemalis, F. coelebs, M. melodia, T. guttata, F. albicollis, M. vitellinus, and G. fortis). Modified from Friis et al. (2022). BUSCO parameters are C, complete genes; S, complete and single-copy genes; D, complete and duplicated genes; F, fragmented genes; M, missing genes. landscape (Fig. 3b) suggests that LINE retrotransposons experienced a drop in their genomic accumulation in recent times (0-5% divergence; Fig. 3b), whereas LTR retrotransposons kept accumulating at the same rate. Such a recent replacement of LINE retrotransposon activity with a diversity of LTR retrotransposons has been noted in other songbirds and seems to have occurred independently in the so far analyzed passerine families, i.e. estrildid finches (Warren et al. 2010, Boman et al. 2019, flycatchers , crows (Weissensteiner et al. 2020), and birds-of-paradise (Peona et al. 2021a(Peona et al. , 2021b. Finally, the satDNA landscape (Fig. 3b) shows that satDNA arrays experienced differential amplification in copy numbers in recent times (0-10% divergence), implying fast evolution of this genomic fraction in the genome (Peona et al. 2022).

Transcriptome sequencing, genome annotation, and gene function prediction
Iso-Seq sequencing yielded a total of 4,627,382 CCS reads (125,633-1,087,892 reads per tissue, Table 3). This resulted in numbers of high-quality isoforms ranging from 16,078 to 80,600 per tissue. On average 8,833 genes were predicted per tissue, ranging from 4,772 in muscle to 10,924 in the liver. Transcriptome completeness evaluated through BUSCO ranged from 31.2 to 57.5% complete BUSCO per tissue (Table 3). The Iso-Seq transcriptomes were then used as splice evidence in GeMoMa to perform a predominantly homology-based annotation of the reference genome. We predicted 18,143 protein-coding genes with a total of 320,754 exons and 289,421 introns. The number of exons, CDS, and introns was higher for our O. melanoleuca annotation compared with the annotations of other songbirds, such as Junco hyemalis, Fringilla coelebs, Melospiza melodia, Taeniopygia guttata, Ficedula albicollis, Manacus vitellinus, and Geospiza fortis (Table 2). Mean gene length, CDS length, exon length, and number of exons per gene, on the other hand, were in the range of values obtained for the abovementioned songbird annotations (Table 3). Of the 18,143 predicted genes, 17,553 (96.7%) were annotated with protein families or function assignment, and 12,472 (68.7%) genes obtained a GO term assignment through InterProScan. The most abundant GO terms were associated with "cell part," "cell" and "membrane" in the cellular  Repeat annotation landscapes. a) Pie chart summarizing the TE content annotated in the genome assembly. b) TE landscape. The divergence between interspersed repeat copies and their consensus sequences is shown on the X-axis as genetic distance calculated using the Kimura 2-parameter distance. The percentage of the genome assembly occupied by transposable elements is shown on the Y-axis. c) Satellite DNA landscape. The divergence between the satellite DNA consensus sequences and sequences annotated in the short-read library is shown on the X-axis as genetic distance calculated using the Kimura 2-parameter distance. The percentage of the genome (short reads) annotated as satellite DNA is shown on the Y-axis. binding" in the molecular function category, and "cellular metabolic process" or "metabolic process" in the biological process category ( Supplementary Fig. 2). BUSCO completeness of the final annotation as judged from avian BUSCO (n = 8,338) was 98.0%, with 97.4% single-copy BUSCO, 0.6% duplicated BUSCO, 0.6% fragmented BUSCO, and 1.5% missing BUSCO. This suggests an accurate and rather complete annotation.

Data availability
All data, including the assembly, its annotation, and the original sequencing, data are available on the European Nucleotide Archive under project accession PRJNA937434. Code for the repeat analysis is available on https://github.com/ValentinaBoP/ WheatearGenomeAnalysis. Supplemental material available at figshare: https://doi.org/10. 25387/g3.22209697.